26 research outputs found
Generalizing, Decoding, and Optimizing Support Vector Machine Classification
The classification of complex data usually requires the composition of processing steps. Here, a major challenge is the selection of optimal algorithms for preprocessing and classification. Nowadays, parts of the optimization process are automized but expert knowledge and manual work are still required. We present three steps to face this process and ease the optimization. Namely, we take a theoretical view on classical classifiers, provide an approach to interpret the classifier together with the preprocessing, and integrate both into one framework which enables a semiautomatic optimization of the processing chain and which interfaces numerous algorithms
Packing: Towards 2x NLP BERT Acceleration
We find that at sequence length 512 padding tokens represent in excess of 50%
of the Wikipedia dataset used for pretraining BERT (Bidirectional Encoder
Representations from Transformers). Therefore by removing all padding we
achieve a 2x speed-up in terms of sequences/sec. To exploit this characteristic
of the dataset, we develop and contrast two deterministic packing algorithms.
Both algorithms rely on the assumption that sequences are interchangeable and
therefore packing can be performed on the histogram of sequence lengths, rather
than per sample. This transformation of the problem leads to algorithms which
are fast and have linear complexity in dataset size. The shortest-pack-first
histogram-packing (SPFHP) algorithm determines the packing order for the
Wikipedia dataset of over 16M sequences in 0.02 seconds. The non-negative
least-squares histogram-packing (NNLSHP) algorithm converges in 28.4 seconds
but produces solutions which are more depth efficient, managing to get near
optimal packing by combining a maximum of 3 sequences in one sample. Using the
dataset with multiple sequences per sample requires additional masking in the
attention layer and a modification of the MLM loss function. We demonstrate
that both of these changes are straightforward to implement and have relatively
little impact on the achievable performance gain on modern hardware. Finally,
we pretrain BERT-Large using the packed dataset, demonstrating no loss of
convergence and the desired 2x speed-up
Tuple Packing: Efficient Batching of Small Graphs in Graph Neural Networks
When processing a batch of graphs in machine learning models such as Graph
Neural Networks (GNN), it is common to combine several small graphs into one
overall graph to accelerate processing and remove or reduce the overhead of
padding. This is for example supported in the PyG library. However, the sizes
of small graphs can vary substantially with respect to the number of nodes and
edges, and hence the size of the combined graph can still vary considerably,
especially for small batch sizes. Therefore, the costs of excessive padding and
wasted compute are still incurred when working with static shapes, which are
preferred for maximum acceleration. This paper proposes a new hardware agnostic
approach -- tuple packing -- for generating batches that cause minimal
overhead. The algorithm extends recently introduced sequence packing approaches
to work on the 2D tuples of (|nodes|, |edges|). A monotone heuristic is applied
to the 2D histogram of tuple values to define a priority for packing histogram
bins together with the objective to reach a limit on the number of nodes as
well as the number of edges. Experiments verify the effectiveness of the
algorithm on multiple datasets
Extreme Acceleration of Graph Neural Network-based Prediction Models for Quantum Chemistry
Molecular property calculations are the bedrock of chemical physics.
High-fidelity \textit{ab initio} modeling techniques for computing the
molecular properties can be prohibitively expensive, and motivate the
development of machine-learning models that make the same predictions more
efficiently. Training graph neural networks over large molecular databases
introduces unique computational challenges such as the need to process millions
of small graphs with variable size and support communication patterns that are
distinct from learning over large graphs such as social networks. This paper
demonstrates a novel hardware-software co-design approach to scale up the
training of graph neural networks for molecular property prediction. We
introduce an algorithm to coalesce the batches of molecular graphs into fixed
size packs to eliminate redundant computation and memory associated with
alternative padding techniques and improve throughput via minimizing
communication. We demonstrate the effectiveness of our co-design approach by
providing an implementation of a well-established molecular property prediction
model on the Graphcore Intelligence Processing Units (IPU). We evaluate the
training performance on multiple molecular graph databases with varying degrees
of graph counts, sizes and sparsity. We demonstrate that such a co-design
approach can reduce the training time of such molecular property prediction
models from days to less than two hours, opening new possibilities for
AI-driven scientific discovery
Support Vector Machine Klassifikation Generalisieren, Dekodieren und Optimieren
The classification of complex data usually requires the composition of processing steps. Here, a major challenge is the selection of optimal algorithms for preprocessing and classification. Nowadays, parts of the optimization process are automized but expert knowledge and manual work are still required. We present three steps to face this process and ease the optimization. Namely, we take a theoretical view on classical classifiers, provide an approach to interpret the classifier together with the preprocessing, and integrate both into one framework which enables a semiautomatic optimization of the processing chain and which interfaces numerous algorithms